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ABSTRACT 

Evaluation of recommender systems is typically done with 
finite datasets. This means that conventional evaluation 
methodologies are only applicable in offline experiments, 
where data and models are stationary. However, in real 
world systems, user feedback is continuously generated, at 
unpredictable rates. Given this setting, one important issue 
is how to evaluate algorithms in such a streaming data envi¬ 
ronment. In this paper we propose a prequential evaluation 
protocol for recommender systems, suitable for streaming 
data environments, but also applicable in stationary set¬ 
tings. Using this protocol we are able to monitor the evo¬ 
lution of algorithms’ accuracy over time. Furthermore, we 
are able to perform reliable comparative assessments of algo¬ 
rithms by computing significance tests over a sliding window. 
We argue that besides being suitable for streaming data, pre¬ 
quential evaluation allows the detection of phenomena that 
would otherwise remain unnoticed in the evaluation of both 
offline and online recommender systems. 
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1. INTRODUCTION 

Usage-based algorithms for recommender systems rely on 
user-provided data. In a typical lab setting, this data has 
been previously collected from a system and consists of a 
finite set of user generated actions - typically item ratings. 
These datasets contain enough data to objectively apply 
well studied methodologies to evaluate recommendation al¬ 
gorithms in a laboratory setting. However, it is increasingly 
consensual that the accuracy obtained by algorithms in con¬ 
trolled environments does not translate directly into good 
performance or overall user-perceived quality in a real-world 
production environment. 

We use the words “online” and “offline” to refer to the en¬ 
vironment in which a system functions and/or is evaluated. 
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Online systems run in production mode in a real-world set¬ 
ting - i.e. they provide or support an active online service 
to real users. Offline systems essentially run in laborato- 
rial or controlled conditions mainly for development and/or 
systematic evaluation purposes. 

In this paper, we propose the application of a prequen¬ 
tial evaluation methodology [8j for the evaluation of rec¬ 
ommender systems. Prequential evaluation is primarily de¬ 
signed to evaluate algorithms that learn from continuous 
flows of data - data streams. If we look at user-generated 
data typically used by recommendation systems, we can 
safely state that this user feedback is generated online at un¬ 
predictable rates and ordering, and is potentially unbounded. 
This is the exact definition of a data stream [ 2 ]. On the 
one hand, this motivates the use of incremental algorithms, 
since batch rebuilding of predictive models may eventually 
become too expensive. On the other hand it also motivates a 
reflection on the applicability of classic evaluation method¬ 
ologies to non-stationary models. This is because we are 
no longer trying to evaluate one model within a well de¬ 
limited time frame, but rather a continuous and ever unfin¬ 
ished learning process. Prequential evaluation is especially 
adequate for this kind of setting. Nevertheless it is also ap¬ 
plicable offline, with static datasets, as illustrated in this 
paper (Sec. |3J). 

Holdout methods are widely used in the evaluation of rec¬ 
ommender systems, however they are designed for batch al¬ 
gorithms and are not directly applicable in a non-stationary 
setting. Indeed, if data points are constantly being gen¬ 
erated we can only take subsets of the available data and 
evaluate the algorithms on those subsets. Moreover, if we 
decide to implement an incremental algorithm, some issues 
on its evaluation, such as dataset ordering or recommenda¬ 
tion bias, are not easy to circumvent (see Sec. [2j . 

Prequential evaluation does not require data pre-processing 
and is not restrictive in terms of evaluation criteria. It may 
include accuracy metrics typically used offline - e.g. preci¬ 
sion/recall, RMSE -, but also allows online measurements 
of complex user interaction behavior or acceptance feedback, 
thereby including actual users in the evaluation process. By 
collecting diverse statistics, it is also possible to combine 
several important dimensions of the evaluation of recom¬ 
mender systems, such as novelty, serendipity, diversity, trust 
and coverage [201, or to collect online A/B testing feedback 
data [13]. We illustrate the use of prequential evaluation by 


observing a simple accuracy metric over time for the com¬ 
parison of three algorithms, along with pairwise statistical 
significance tests. 

To our knowledge, prequential evaluation has only been used 
for recommendation algorithms very recently in our work 
. The first essentially uses the evaluation 
process described in this paper, and is a more direct appli¬ 
cation of the prequential methodology used in data stream 
mining. The second proposes a hybrid method that uses 
both holdouts and prequential evaluation in mini-batches. 
While in [23] our focus is essentially on the proposed al¬ 
gorithm, this paper focuses on the evaluation methodology 
itself with more detail, with the intent to raise discussion 
on the evaluation issues of incremental algorithms. We also 
illustrate the applicability of statistical significance tests to 
compare algorithms over time. 

The remainder of this paper is structured as follows. In 
Sec. [2]we describe the traditional batch evaluation method¬ 
ologies. Prequential evaluation is described in Sec. [3] We 
present an illustrative evaluation of three incremental rec¬ 
ommendation algorithms in Sec. [4] Finally we conclude in 
Sec. [5] 

2. EVALUATION METHODOLOGIES 

Traditionally, holdout methods are used in the batch eval¬ 
uation of recommender systems. They begin by splitting 
the ratings dataset in two subsets - training set and test¬ 
ing set - randomly choosing data elements from the initial 
dataset. The training set is initially fed to the recommender 
algorithm to build a predictive model. 

There is some variety of offline protocols to evaluate ac¬ 
curacy, however they are essentially variations of holdout 
strategies. Generally, these protocols “hide” a subset of rat¬ 
ings given by each user in the test set. These hidden in¬ 
teractions form a hidden set. Algorithms are evaluated by 
measuring the difference between predictions and the actual 
observations in the hidden set. 

2.1 Issues with batch evaluation 

Given the described offline evaluation methodology we iden¬ 
tify the following issues: 


and in 22 


• Dataset ordering-, randomly selecting data for train¬ 
ing and test, as well as random hidden set selection, 
shuffles the natural sequence of the data. Algorithms 
designed to deal with naturally ordered data cannot 
be rigorously evaluated if datasets are shuffled. One 
straightforward solution is simply not to shuffle data. 
That is, to pick a moment in time or a number of rat¬ 
ings in the dataset as the split point. All ratings given 
before the split point are used to train the model and 
all subsequent ratings are used as testing data. One 
awkwardness with this approach is how to select the 
hidden set. In [20] and 1^ the authors suggest that 
all ratings in the test set should be hidden; 

• Time awareness : shuffling data potentially breaks the 
logic of time-aware algorithms. For example, by using 
future ratings to predict past ratings. This issue may 


as well be solved by keeping the chronological order of 
data; 

• Incremental updates-, incremental algorithms perform 
incremental updates of their models as new data points 
become available. This means that neither models or 
training and test data are static. Models are contin¬ 
uously being readjusted with new data. As far as we 
know to this date, the only contributions in the field 
of recommender systems that explicitly address this 
issue are [23] and [22]. This issue has already been 
addressed in the held of data stream mining [8j 9 ; 

• Session grouping : most natural datasets, given their 
unpredictable ordering, require some pre-processing to 
group ratings either by user or user session in order 
to use offline protocols. As data points accumulate, 
it eventually may become too expensive to re-group 
them. This is true also for any other kind of data 
pre-processing task; 

• Recommendation bias : in online production systems, 
user behavior is - at least expectedly - influenced by 
recommendations themselves. It is reasonable to as¬ 
sume, for instance, that recommended items will be 
more likely followed than if they were not recommended. 
Simulating this offline usually requires complicated user 
behavior modeling which can be expensive and prone 
to systematic error. One way to evaluate the actual 
impact of a recommender system is to conduct user 
surveys and/or A/B testing [T3] [5] [12] [18] . 


The above limitations, along with other known issues [20, 16, 
ll], weaken the assumption that user behavior can be accu¬ 
rately modeled or reproduced in offline experiments. From 
a business logic perspective [ 7 ] offline evaluation may also 
not be timely enough to support decision making. These is¬ 
sues motivate the research of alternative or complementary 
evaluation methodologies. 

2.2 Offline evaluation protocols and metrics 

One important consideration about the evaluation of a rec¬ 
ommendation algorithm is the type of problem or task being 
approached. When dealing with explicit numeric ratings, 
the first task of the algorithm is to accurately predict un¬ 
known ratings. This is usually referred to as a rating predic¬ 
tion task and is most naturally seen as a regression problem. 
One way to assess the accuracy of rating prediction algo¬ 
rithms is to measure the error of predicted ratings, given 
the true values in the hidden set [2l], using metrics such 
as Mean Absolute Error (MAE) and Root Mean Squared 
Error (RMSE). This protocol is in fact the most common 
approach, having been used in highly popularized competi¬ 
tions such as the Netflix prize 1 and KDD-Cup 2011 6 . 

However, numeric ratings may not be available. In such 
cases, the data usually consists of a record of positive-only 
user-item interactions. The task is then to predict good 
items to recommend. This problem is usually referred to 
as item prediction. Item prediction problems can be evalu¬ 
ated both as classification and ranking problems. Accuracy 
is measured by matching recommendation lists to the true 
hidden items for each user. Typically, classification metrics 



such as Precision, Recall and F-measure or ranking metrics 
such as Mean Average Precision (MAP) fl5| or Normalized 


Discounted Cumulative Gain (NDCG) |24| are used. The 
first protocols used for the evaluation of item recommenda¬ 
tion problems are the ones known as All-but-N and Given-N 
j2j. The All-but-N protocol hides exactly N items from each 
user in the test set. One popular sub-protocol is the All-but- 
One protocol, which hides exactly one item from each user 
in the test set. The Given-N protocol keeps exactly N items 
in the test set and hides all others. 


3. PREQUENTIAL EVALUATION 

Given the problems listed in Section [2.1| we propose a pre- 
quential approach |8], especially suited for the evaluation of 
algorithms in a non-stationary environment. Essentially, the 
prequential method consists of a test-then-learn procedure 
that runs for each new data point. Given a newly observed 
data point, a prediction is made and tested - e.g. measuring 
error. Then, the data point is used to update the model. 

In this paper, we illustrate prequential evaluation with item 
prediction recommenders. The item prediction task consists 
of selecting good items for recommendation, which are typ¬ 
ically presented to the user as a ranked list. In this task 
prequential evaluation consists of the folowing steps: 

1. If u is a known user, use the current model to recom¬ 
mend N items to u, otherwise go to step 3; 

2. Score the recommendation list given the true observed 
item i; 

3. Update the model with the observed event; 

4. Proceed to the next event in the dataset; 


In its strict formulation, prequential evaluation does not re¬ 
quire step 3. Indeed, one may not wish to update the model 
at every single observation, or ever. This allows the com¬ 
parison between different types of algorithms, for example, 
incremental vs. batch algorithms. 


this process allows us to follow the evolution of the recom- 
mender by keeping online statistics of any number of chosen 
metrics. Thereby it is possible to depict how the algorithm’s 
performance evolves over time. In Sec. [4] we present both 
the overall average score and complement it with plots of the 
evolving score using a simple moving average of an accuracy 
metric. 

One challenging aspect of this method is that it only eval¬ 
uates over a single item at each step, potentially failing to 
recognize other possible good recommendations. If item i is 
not recommended at the time the observation is made, the 
score will naturally be 0. However, other items within the N 
recommendations may occur in future observations for that 
user. In other words, the protocol exclusively evaluates how 
well the model predicts the next observation, ignoring all 
subsequent ones. Although this is a somewhat challengingly 
strict protocol, we have performed experiments by match¬ 
ing the recommended items with not just the current, but 
all future observations for each user - only possible offline 
and found that overall scores do not improve substantially. 
However, this strictness of the protocol may potentially have 
a higher impact with other metrics or data. One way to re¬ 
lax this, is to match the active observation not only with 
the current prediction, but also with a set of previous pre¬ 
dictions. One other possible approach is to use a hybrid 
evaluation method such as in [22] . 

4. APPLYING THE METHODOLOGY 

To illustrate the usefulness of prequential evaluation in rec- 
ommender systems, we perform a set of experiments us¬ 
ing this protocol. We use three item recommendation al¬ 
gorithms that learn recommendation models incrementally 
as user feedback data becomes available. These algorithms 
are designed to process positive-only feedback - also known 
as binary feedback. However, we emphasize that this is not 
a restriction of the evaluation protocol, since it is possible to 
use the exact same methodology for rating prediction prob¬ 
lems as well. 


This protocol provides several benefits: 

• It allows continuous monitoring of the system’s perfor¬ 
mance over time; 

• Several metrics can be captured simultaneously; 

• If available, user feedback can be included in the loop; 

• Real-time statistics can be integrated in the algorithms’ 
logic - e.g. automatic parameter adjustment, drift/shift 
detection, triggering batch retraining; 

• In ensembles, relative weights of individual algorithms 
can be adjusted; 

• The protocol is applicable to both item prediction and 
rating prediction; 

• By being applicable both online and offline, experi¬ 
ments are naturally reproducible if the same data se¬ 
quence is available. 

In an offline experimental setting, an overall average of in¬ 
dividual scores can be computed at the end - because lab 
datasets are inevitably finite - and on different time hori¬ 
zons. For a recommender running in a production system, 


4.1 Datasets 

We use four distinct datasets, described in Table [l] All 
datasets consist of a chronologically ordered set of pairs in 
the form < user, item >. Music-listen and Lastfm-600k 
consist of music listening events obtained from two distinct 
sources, where each tuple corresponds to a music track be¬ 
ing played by a user. Music-playlist consists of a times- 
tamped log of music track additions to personal playlists. 
MovieLens-lM is well known dataselQ consisting of times- 
tamped movie ratings in a 1 to 5 rating scale. To use this 
dataset in an item prediction setting, since we intend to 
retain only positive feedback, movie ratings below the maxi¬ 
mum rating 5 are excluded. Lastfm-600k consists of the first 
8 months of activity observed in the Last. M3 dataset orig¬ 
inally used in 13]. Both Music-listen and Music-playlist are 
extracted from the Palco Principal website, a social net¬ 
work dedicated to non-mainstream music enthusiasts and 
artists. 


1 http://www.grouplens.org, 2003 
“http://last.fm 

3 http://www.palcoprincipal.com 




a) Music-listen 


b) Lastfm-600k 





Figure 1: Evolution of recall@10 with four datasets. The plotted lines correspond to a moving average of the 
recall@10 obtained for each prediction. The window size n of the moving average is a) n = 2000, b) n = 3000, 
c) n = 5000 and d) n = 5000. The first n points are delimited by the vertical dashed line and are plotted using 
the accumulated average. Plots a) and b) do not include repeated events in the datasets. 


Dataset 

Events 

Users 

Items 

Sparsity 

Music-listen 

335.731 

4.768 

15.323 

99,90% 

Lastfm-600k 

493.063 

164 

65.013 

99,11% 

Music-playlist 

111.942 

10.392 

26.117 

99,96% 

MovieLens-lM 

226.310 

6.014 

3.232 

98,84% 


Table 1: Dataset description 


4.2 Algorithms used and overall accuracy 

Using the prequential approach described in Sec. [3] we com¬ 
pare the online accuracy of 3 incremental item recommen¬ 
dation algorithms: ISGD [23], BPRMF 19 and a classic 
incremental user-based neighborhood algorithm 17 , all im- 


plemented in MyMediaLita 


We measured accuracy 


with recall at cut-off 10 - denoted by recall@10. 


Overall results with average update times are presented in 
Tab. [2] These are possible to obtain in offline experiments, 
given that lab datasets are finite. However, in online pro¬ 
duction systems these results can only be interpreted as a 
snapshot of the algorithms’ performance within a predefined 
time frame. 

4 http: //mymedialite .net 


Dataset 

Algorithm 

Recall® 10 

Update time 

Music-listen 

BPRMF 

ISGD 

UserKNN 

0.028 

0.061 

0.139 

0.846 ms 

0.118 ms 

328.917 ms 

Lastfm-600k 

BPRMF 

ISGD 

UserKNN 

0.003 

0,034 

0.006 

28.061 ms 

1.106 ms 

290.133 ms 

Music-playlist 

BPRMF 

ISGD 

UserKNN 

0.020 

0.171 

0.132 

1.889 ms 

0.949 ms 

190.250 ms 

Movielens-IM 

BPRMF 

ISGD 

UserKNN 

0.080 

0.050 

0.110 

0.173 ms 

0.016 ms 

84.927 ms 


Table 2: Overall results. Best performing algo¬ 
rithms are highlighted in bold for each dataset. Up¬ 
date times are the average value of the update time 
for all data points. 
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Figure 2: Signed McNemar test of ISGD against BPRF and UKNN. Test is computed over a sliding window 
of n observations with a) n = 2000, b) n = 3000, c) n = 5000 and d) n = 5000. 


4.3 Accuracy over time 

One valuable feature of our adopted evaluation protocol is 
that it allows the monitoring of the learning process as it 
evolves over time. To do that, we need to maintain statis¬ 
tics of the outcome of the predictions. We study how the 
algorithms’ accuracy evolves over time by depicting in Fig. 
[l] a moving average of the recall@10 metric. The moving 
average sizes are chosen to obtain clear lines in Fig. 0 for 
illustrative purposes. We do not argue that these values are 
any better than others. 

The plotted evolution of the algorithms with each dataset 
generally confirms overall results, however more details be¬ 
come available. For instance, although the overall averages 
of ISGD and UKNN are relatively close with the Music- 
playlist dataset, Fig. |T] c) shows that these algorithms be¬ 
have quite differently, starting with a very similar accuracy 
level and then diverging substantially. Although this kind 
of observation could be important for a rigorous evaluation, 
it is diluted in a single average in Table [2] 

4.4 Statistical significance over time 

We also depict in Fig. [2] statistical significance tests using 
the signed McNemar test over sliding windows 0 of the 
same size as the ones used for the moving averages used in 
Fig-0 We set a significance level of 1%. Because McNemar 
is a pairwise test, a complete comparative assessment with 
four datasets and three algorithms would require 12 tests. 
However, to avoid multiple tests we can compare one pro¬ 
posed algorithm with existing ones. Alternatively we can 
use p-value corrections. In this illustrative experiment we 
compare the ISGD algorithm with the other two on the four 
datasets, which yields 8 tests. The main observation from 
Fig. 0 is that the most of the apparent diferences in Fig. 
0are statistically significant. However, the visualization of 
the McNemar test clarifies some comparisons. 


The online monitoring of the learning process allows a more 
detailed evaluation of the algorithms’ performance. Figure 
0 reveals phenomena that would otherwise be hidden in a 
typical batch evaluation. We consider that this finer grained 
evaluation process provides a deeper insight into the learning 
processes of predictive models. 

5. CONCLUSIONS 

In this paper, we propose a prequential evaluation frame¬ 
work to monitor evaluation metrics of recommender sys¬ 
tems as they continuously learn from a data stream. To 
illustrate its applicability and appropriateness we use this 
framework to compare three incremental recommendation 
algorithms. We notice that our evaluation method allows 
a finer grained assessment of algorithms, by being able to 
continuously monitor the outcome of the learning process. 
Moreover, it is possible to integrate multiple measures si¬ 
multaneously in the evaluation process, thereby evaluating 
multiple dimensions. We also show the applicability of sta¬ 
tistical significance tests. 
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